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Conditional Kolmogorov Complexity and 
Universal Probability 



Paul M.B. Vitanyi 



Abstract 

The Coding Theorem of L.A. Levin connects unconditional prefix Kolmogorov complexity with 
the discrete universal distribution. There are conditional versions referred to in several publications but 
as yet there exist no written proofs in English. Here we provide those proofs. They use a different 
definition than the standard one for the conditional version of the discrete universal distribution. Under 
the classic definition of conditional probability, there is no conditional version of the Coding Theorem. 



I. Introduction 

Informally, the Kolmogorov complexity, or algorithmic entropy, of a string x is the length 
(number of bits) of a shortest binary program (string) to compute x on a fixed reference universal 
computer (such as a particular universal Turing machine). Intuitively, this quantity represents the 
minimal amount of information required to generate x by any effective process. The conditional 
Kolmogorov complexity of x relative to y is defined similarly as the length of a shortest binary 
program to compute x, if y is furnished as an auxiliary input to the computation [6]. 

The Coding Theorem (3) of L.A. Levin [8] connects a variant of Kolmogorov complexity, 
the unconditional prefix Kolmogorov complexity, with the discrete universal distribution. The 
negative logarithm of the latter is up to a constant equal to the former. The conditional in 
conditional Kolmogorov complexity commonly is taken to be a finite binary string. 

A conditional version of the Coding Theorem as referred to in [3], [9], [10], [4], [12] requires 
a function denoted as m(x\y) with x, y e {0, 1}* that is (i) lower semicomputable; (ii) satisfies 
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J2 x m ( x \y) — 1 f° r ever Y V'-> (iii) if P^ly) is a function satisfying (i) and (ii) then there is a 
constant c such that cm(x\y) > p(x\y) for all x and y. There is no written complete proof of the 
conditional version of the Coding Theorem. Our aim is to provide such a proof and write it out 
in detail rather than rely on "clearly" or "obviously." One wants to be certain that applications 
of the conditional version of the Coding Theorem are well founded. 

Since the discrete universal distribution m over one variable is a semiprobability mass function, 
that is J2 X m ( x ) < 1> it is natural to consider a universal distribution m(x, y) over two variables 
with J2 X y y) < 1. One then can define the conditional version following the custom in 
probability theory, for example [13], 

, . . m(x,y) 
m(xy) = 1 \ y} (1) 

But in [3], [9], [10], [4], [12] the conditional semiprobability m(x\y) is defined differently, 
namely as in Definition 4. In Theorem 1 for a single distribution, and in Theorem 2 for a joint 
distribution, it is shown that if one uses (1) then m(x\y) does not satisfy a Coding Theorem. 
In contrast, if m(x\y) is defined according to Definition 4 it does have a Coding Theorem, 
Theorem 4. 

The necessary notions and concepts are given in Appendices: Appendix A introduces prefix 
codes, Appendix B introduces Kolmogorov complexity, Appendix C introduces complexity 
notions, and Appendix D tells about our use of 0(1). 

A. related work 

We can enumerate all lower semicomputable probability mass functions with one argument. 
For convenience these arguments are elements of {0, 1}*. The enumeration list is denoted 

There is another interpretation possible. Let prefix Turing machine Tj be the ith element in the 

standard enumeration of prefix Turing machines Ti,T 2 , Then Ri(x) = J^2~l p l where p is 

a program for Tj such that T^p) = x. This Ri(x) is the probability that prefix Turing machine 
Tj outputs x when the program on its input tape is supplied by flips of a fair coin. We can thus 
form the list 

1Z — i?2, • • • ■ 
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Both lists V and 1Z enumerate the same functions and there are computable isomorphisms 
between the two [10] Lemma 4.3.4. 

Definition 1. If U is the reference universal prefix Turing machine, then the corresponding 
distribution in the i?-list is Rjj. 

L.A. Levin [8] proved that 

m(x) = Y t a j P j (x), (2) 

3 

with J2j otj < 1, aj > 0, and a 3 - lower semicomputable, is a universal lower semicomputable 
semiprobability mass function. (For semiprobabilities see Appendix C.) That is, obviously it 
is lower semicomputable and J2 X m(x) < 1. It is called a universal lower semicomputable 
semiprobability mass function since (i) it is itself a lower semicomputable semiprobability mass 
function and (ii) it multiplicatively (with factor aij) dominates every lower semicomputable 
semiprobability mass function P 3 . 

Moreover, he proved the Coding Theorem 

— logm(a;) = — log Ru(x) = K(x), (3) 

where equality holds up to a constant additive term. 

B. Results 

We give a review of the classical definition of conditional probability versus the one used 
in the case of semicomputable probability. In Sections III and IV we show that the conditional 
version of (3) do not hold for the classic definition of conditional probability in the case of a 
single probability distribution (Theorem 1) and for joint distributions (Theorem 2). In Section V 
we consider the Definition 4 of the conditional version of joint semicomputable semiprobability 
mass functions as used in [3], [9], [10], [4], [12]. For this definition the conditional version of 
(3) holds. We write all proofs out in complete detail. 

II. Preliminaries 

Let x, y,z G Af, where j\f denotes the natural numbers and we identify j\f and {0,1}* 
according to the correspondence 

(0,e), (1,0), (2,1), (3, 00), (4, 01),... 
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Here e denotes the empty word. A string x is an element of {0, 1}*. The length \x\ of x is the 
number of bits in x, not to be confused with the absolute value of a number. Thus, 1 1 1 = 3 
and |e| = 0, while | — 3| = |3| = 3. 

The emphasis is on binary sequences only for convenience; observations in any alphabet can 
be so encoded in a way that is 'theory neutral'. Below we will use the natural numbers and the 
binary strings interchangeably. 

III. Conditional Probability 

Let P be a probability mass function on sample space Af, that is, P{x) = 1 where the 
summation is over Af. Suppose we consider x E Af and event B C Af has occurred. According 
to Kolmogorov in [5] a new probability P(x\B) has arisen satisfying: 

1) x B: P{x\B) = 0; 

2) x E B: P(x\B) = P(x)/P(B); 

3) E x e B P(x\B) = l. 

Let m be as defined in (2) with the sample space Af. Then m(x) < 1 and we call m a 
semiprobability . For the conditional versions of semiprobabilities Items 1) an 2) above hold and 
Item 3) holds with <. We show that in with these definitions there is no conditional Coding 
Theorem. 

Theorem 1. Let B C Af and \B\ < oo. Then - \ogm{x\B) ^ K{x\B) + 0(1). 

Proof: (x g B) This implies m(x\B) = and therefore — logm(:r|i?) = oo. But K(x\B) < 

oo. 

(x E B) We can replace B by its characteristic string: \xb\ = \B\ and xb is defined by 
Xb(0 = 1 if i £ B and XbW — otherwise. Rewrite the conditional 

mix) mix) 



m(x\B) 



m(B) m(xs) 

Then, applying the Coding Theorem on the single argument numerator and denominatir of the 
right-hand side, 

-\ogm(x\B) = K(x)-K( X B). 

Let K(xb) > \B\. For every x E B we have K(x) < log |5|+0(loglog \ B\). Then, — log m(x\B) < 
— \B\/2. But for every x and B we have K(x\B) > 0. ■ 
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IV. Lower Semicomputable Joint Probability 

We show that there is no equivalent of the Coding Theorem for the conditional version of 
m according to (1) based on lower semicomputable joint probability mass functions. We use a 
standard pairing function (•, •) to obtain two-argument (joint) lower semicomputable probability 
mass functions from the single argument ones. For example, = |(i + + j + 1) + j. 

Definition 2. Let x,y E Af and f((x,y)) be a lower semicomputable function on a single 
argument such that we have J2( x y ) f(i x i v)) ^ 1- We use these functions / to define the lower 
semicomputable joint semiprobability mass functions Qj(x,y) = f((x,y)). 

Let us define the list 

Q = Qi,Q2,.... 

We can effectively enumerate the family of lower semicomputable joint semiprobability mass 
functions as before by Q. We can now define the lower semicomputable joint universal proba- 
bility by 

m ( x , y) = ^2 a jQj ( x , v) , ( 4 ) 

3 

with J2j a j ^ 1- Classically, for a joint probability mass function P(x,y) with x, y G M and 
J2 X y P( x , y) — 1 one defines the conditional version [1] by 

pi i \ p ( x ,y) 
P(Av) = E^)' 

We call P\(x) = ^2 z P(x,z) and P2{y) = ^2 z P(z,y) the marginal probability of x and y, 
respectively. This form of conditional P{x\y) corresponds with P(x\B) in Section III in that 
B = {(z, y) : z E Af}. The semiprobability m in (1) satisfies J2 X y m(x, y) < 1 and the analogue 
of the above yields 

Definition 3. The conditional version of m(x,y) is defined by 

m(x,y) 



m 



(x\y) 



'Ez m ( z ^y) 

T. j a jQA x ^y) 
^2j a jQj( z ^y) 

J2j a jQj( x >y) 
J2j®jI2 z Qj(z,y) 
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This conditional version m(x\y) is the quotient of two lower semicomputable functions. It 
may not be semicomputable (not proved here). We show that there is no conditional coding 
theorem for this version of m(x\y). 

THEOREM 2. Let x,y eAf. Then, -logm(x|y) > K{x\y) +0(\y\). The 0(\y\) term in general 
cannot be improved. 

Proof: By (4) and the Coding Theorem we have — logm(x, y) = K({x, y)) + 0(l). Clearly, 
K({x,y}) = K(x,y) + 0(1). The marginal universal probability m 2 (y) is given by m 2 (y) = 
J2 Z m(z, y) > m(e, y). Thus, with the last equality due to the Coding Theorem: — log m 2 (y) < 

— logm(e, y) = K((e,y)) + 0(1) = K(y) + 0(1). By the Symmetry of Information (9) we 
find K(x, y) = K(y) + K(x\y, K(y)) + 0(1). Here K(x\y, K(y)) = K(x\y) + 0(log \y\). Since 
m(x\y) = m(x,y)/m 2 (y) by Definition 4, we have — log m(x\y) = — log m(x, y)+\og m 2 (y) > 

— \ogm(x,y) + logm((e, y)) = K(x\y) + 0(log(|y|). Here the first inequality follows from 
the relation between m 2 (y) and m((e, y)), while the last equality follows from (9). In [3] it 
is shown that for every length of the binary representation of y e H there are y such that 
K(x\y,K(y)) = K(x\y) + Q(\og\y\). U 

V. Lower Semicomputable Conditional Probablity 

We consider lower semicomputable conditional semiprobabilities directly in order to obtain 
a conditional semiprobability that (i) is lower semicomputable itself, and (ii) dominates mul- 
tiplicatively every lower semicomputable conditional semiprobability. Let f(x,y) be a lower 
semicomputable function. We use these functions / to define lower semicomputable conditional 
semiprobability mass functions P(x\y). 

Theorem 3. There is a universal conditional lower semicomputable semiprobability mass 
function. We denote it by m. 

Proof: We prove the theorem in two steps. In Stage 1 we show that the two-argument lower 
semicomputable functions which sum over the first argument to at most 1 can be effectively 
enumerated as 

P 1 ,P 2 ,... . 
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This enumeration contains all and only lower semicomputable conditional semiprobability mass 
functions. In Stage 2 we show that P as defined below multiplicatively dominates all Pf 

Po( x \y) = ^Uj p My)i 

3 

with J2 a j < 1' an d Oij > and lower semicomputable. Stage 1 consists of two parts. In the 
first part, we enumerate all lower semicomputable two argument functions; and in the second 
part we effectively change the lower semicomputable two argument functions to functions that 
sum to at most 1 over the first argument. Such functions leave the functions that were already 
conditional lower semicomputable semiprobability mass functions unchanged. 

Stage 1 Let ipi,ip2, ■ ■ ■ be an effective enumeration of all two-argument real- valued partial 
recursive functions. For example, let ip±(x : y), ^(x, y), . . . be tpi({x, y)), ^{{x, ?/)),•• ■ with 
(•, •) the standard pairing function over the natural numbers. Consider a function ip from this 
enumeration (where we drop the subscript for notational convenience). Without loss of generality, 
assume that each ip is approximated by a rational- valued three-argument partial recursive function 
4>'(x,y,k) = p/q (use (f)'(((x,y),k)) = (p,q)). Without loss of generality, each such 0' is 
modified to a partial recursive function satisfying the properties below. For all x,y,k G Af, 

• if (f)(x,y,k) < oo, then also <j>(x, y, 1), <j>(x, y, 2), . . . , y, k — 1) < oo (this can be 
achieved by the trick of dovetailing the computation of (f)'(((x,y), 1)), <f)'(((x,y},2}), . . . 
and assigning computed values in enumeration order of halting to (f)(x, y, 1), y, 2), . . .); 

. y,k + l) > <f)(x, y, k) (dovetail the computation of <f>'(x, y, 1), <j)'(x, y, 2), . . . and assign 
the enumerated values to 4>(x, y, 1), (f>(x, y, 2), . . . satisfying this requirement and ignoring 
the other computed values); and 

. Hindoo <f>(x, y, k) = ip(x,y) (as does <p'). 
The resulting -0-list contains all lower semicomputable two-argument real-valued functions, and 
is represented by the approximators in the 0-list. Each lower semicomputable function ip (rather, 
the approximating function 0) will be used to construct a function P that sums to at most 1 
over the first argument. In the algorithm below, the local variable array P contains the current 
approximation to the values of P at each stage of the computation. This is doable because the 
nonzero part of the approximation is always finite. 

Step 1: Initialize by setting P(x\y) := for all x,y E Af; and set k := 0. 
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Step 2: Set k := k + 1, and compute 0(1, 1, k), . . . , 4>(k, k, k). {If any <p(i, j, k), 1 < i, j < k, 
is undefined, then the existing values of P do not change.} 

Step 3: if for some j (1 < j < k) we have (f>(l,j,k) + ••• + 4>(k,j,k) > 1 then the 
existing values of P do not change else for i, j := 1, . . . , k set P(i\j) :— 4>(i,j, k) 
{Step 3 is a test of whether the new assignment of P-values satisfy (also future) lower 
semicomputable conditional semiprobability mass function requirements} 

Step 4: Go to Step 2. 

If *p(x,y) satisfies J2 x ip(x,y) — 1 f° r a ^ x,y e Af then P(x\y) = vp(x,y) for all x,y E Af. 
If for some x, y and k with x,y < k the value <f)(x, y, k) is undefined, then the last assigned 
values of P do not change any more even though the computation goes on forever. If the else 
condition in Step 3 is satisfied in the limit with equality by the values of P, it is a conditional 
semiprobability mass function. If if condition in Step 3 gets satisfied, then the computation 
terminates and P's support is finite and it is computable. 

Executing this procedure on all functions in the list <fi 1 , 2 , • • • yields an effective enumeration 
Pi, P 2 , . . . of lower semicomputable functions containing all and only lower semicomputable 
conditional semiprobability mass functions. The algorithm takes care that for all j > 1 we have 

X 

Stage 2 Define the function P as 

Po( x \y) = ^ a 3 p A x \v)^ 

j 

with aj chosen such that ^) . a 3 < 1, <x, > and lower semicomputable for all j. Then P is a 
conditional semiprobability mass function since 

X j X j 

The function Po(-\-) is also lower semicomputable, since Pj(x\y) is lower semicomputable in j 
and x, y. (Use the universal partial recursive function O and the construction above.) Also aj 
is by definition lower semicomputable for all j. Finally, P multiplicatively dominates each Pj 
since for all x,y G Af we have P (x\y) > ajPj(x\y) while <x, > 0. Therefore, P is a universal 
lower semicomputable conditional semiprobability mass function. ■ 
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We can choose the afs in the definition of P in the proof above by setting 



a, = 2 



with the Cj > constants. Then ^ . ccj < 1 by the ubiquitous Kraft inequality [7] (satisfied by 
the prefix complexity K), and ctj > and lower semicomputable for all j. 



Definition 4. We define 



m 



We call m(a;|y) the reference universal lower semicomputable conditional semiprobability mass 
function. 

Corollary 1. If P(x\y) is a lower semicomputable conditional semiprobability mass function, 
then 2 K ( pS) m(x\y) > P(x\y), for all x, y. That is, m(x\y) multiplicatively dominates every lower 
semicomputable conditional semiprobability mass function P(x\y). 

A. A Priori Probability 

Let Pi, P 2 , • • • be the effective enumeration of all lower semicomputable conditional semiprob- 
ability mass functions constructed in Theorem 3. There is another way to effectively enumerate 
all lower semicomputable conditional semiprobability mass functions. Let the input to a prefix 
machine T (with the string y on its auxiliary tape) be provided by an infinitely long sequence of 
fair coin flips. The probability of generating an initial input segment p is 2~l p L If T(p,y) < oo, 
that is, T's computation on p with y on its auxiliary tape terminates, then presented with any 
infinitely long sequence starting with p, the machine T with y on its auxiliary tape, being a 
prefix machine, will read exactly p and no further. 



Let 7\, T 2 , . . . be the standard enumeration of prefix machines in [10]. For each prefix machine 
T, define 



In other words, Q T (x\y) is the probability that T with y on its auxiliary tape computes output 
x if its input is provided by successive tosses of a fair coin. This means that for every string y 
we have that Q T satisfies 




(5) 



T(p,y)=x 



J2QT(x\y)<l. 
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We can approximate Qt(-\u) for every string y as follows. (The algorithm uses the local variable 
Q(x) to store the current approximation to Qr(x\y).) 
Step 1: Fix y e {0, 1}*. Initialize Q(x) := for all x. 

Step 2: Dovetail the running of all programs on T with auxiliary y so that in stage k, step 
k — j of program j is executed. Every time the computation of some program p halts 
with output x, increment Q(x) := Q(x) + 2~l p L 
The algorithm approximates the displayed sum in Equation 5 by the contents of Q(x). Since 
Q(x) is nondecreasing, this shows that Q T is lower semicomputable. Starting from a standard 
enumeration of prefix machines Ti,T 2 ,..., this construction gives for every y e {0,1}* an 
enumeration of only lower semicomputable conditional probability mass functions 

Qi(-\y),Q2(-\y),... ■ 

To merge the enumerations for different y we use dovetailing over the index % of Qi and y. The 
P-enumeration of Theorem 3 contains all elements enumerated by this Q -enumeration. In [10] 
Lemma 4.3.4 the reverse is shown. 

Definition 5. The conditional universal a priori probability on the positive integers is defined 
as 

Qu(x\y)= 2 " W » 

U(p,y)=x 

where U is the reference prefix machine. 

Remark 1. The use of prefix machines in the present discussion rather than plain Turing 
machines is necessary. By Kraft's inequality the series X! p 2~' p ' converges (to < 1) if the 
summation is taken over all halting programs p of any fixed prefix machine with a fixed auxiliary 
input y. In contrast, if the summation is taken over all halting programs p of a universal plain 
Turing machine, then the series X^ p 2~' p ' diverges. 

B. The Conditional Coding Theorem 

Theorem 4. There is a constant c such that for every x, 

log H^) = log OT = A ' ( * ) ' 
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with equality up to an additive constant c. 

Proof: Since 2~ K ^ X ^ represents the contribution to Qu{x\y) by a shortest program for x 
given the auxiliary y, we have 2~ K ^ X ^ < Qu(x\y), for all x,y. 

Clearly, Qu{x\y) is lower semicomputable. Namely, enumerate all programs for x given y, by 
running reference machine U on all programs with y as auxiliary at once in dovetail fashion: in 
the first phase, execute step 1 of program 1; in the second phase, execute step 2 of program 1 
and step 1 of program 2; in the ith phase (i > 2), execute step j of program k for all positive j 
and k such that j + k = i. By the universality of m(x\y) in the class of lower semicomputable 
conditional semiprobability mass functions, QuiAv) = 0(m(x\y)). 

It remains to show that m(x\y) = 0(2~ K ^ X ^). This is equivalent to proving that K(x\y) < 
log l/m(x\y) + 0(1), as follows. Exhibit a prefix-code E encoding each source word x given 
y as a code word E(x\y) = p, satisfying 

|p|<log— ^p- + 0(l), 
m{x\y) 

together with a decoding prefix machine T such that T(p,y) = x. Then, 

K T {x\y) < 

and by the Invariance Theorem (7) 

K(x\y)<K T (x\y) + c T , 

with c T > a constant that may depend on T but not on x,y. Note that T is fixed by the 
above construction. On the way to constructing E as required, we recall a construction for the 
Shannon-Fano code: 

Lemma 1. Ifp is a function on the nonnegative integers, and J2 x p(x) < 1, then there is a binary 
prefix-code e such that the code words e(l), e(2), . . . can be length-increasing lexicographically 
ordered and \e(x)\ < logl/p(x) + 2. 

Proof: Let [0, 1) be the half-open real unit interval, corresponding to the sample space S = 
{0, 1}°°. Each element u of S corresponds to a real number 0.u>. Let x G {0, 1}*. The half-open 
interval [0.x, 0.x + 2~^) corresponding to the cylinder (set) of reals = {O.cu : uj — x . . . e S} 
is called a binary interval . We cut off disjoint, consecutive, adjacent (not necessarily binary) 
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intervals I x of length p(x) from the left end of [0, 1), x — 1, 2, . . . . Let i x be the length of the 
longest binary interval contained in I x . Set E(x) equal to the binary word corresponding to the 
leftmost such interval. Then |e(x)| = \ogl/i x . It is easy to see that I x is covered by at most 
four binary intervals of length i x , from which the lemma follows. ■ 

We use this construction to find a prefix machine T such that K T (x\y) < log l/m(a;|y) + c. 
That m(x\y) is not computable but only lower semicomputable results in c = 3. 

Since m(x\y) is lower semicomputable, there is a partial recursive function <f)(x,y,t) with 
4>(x,y,t) < m(x\y) and (f)(x,y,t + 1) > (f)(x,y,t), for all t. Moreover, Mm^^ <fi(x,y,t) = 
m(x\y). Let tp(x,y,t) be the greatest partial recursive lower bound of the following special 
form on <f)(x,y,t) defined by 

i>(x,y,t) := {2~ k : 2~ k < 4>{x,y,t) < 2 • 2~ k and cf>(x,y,j) < 2~ k for all j < t}, 

and tp(x,y,t) := otherwise. Let ip enumerate its range without repetition. Then, 

x,y,t x y t 

The series J2 X yt^( x ^y^) can converge to precisely 2m(x\y) only in case there is a positive 
integer k such that m(x\y) = 2~ k . 

In a manner similar to the above proof we chop off consecutive, adjacent, disjoint half-open 
intervals I x ,y,t of length ip(x : y : t)/2, in enumeration order of a dovetailed computation of all 
ip(x, y, t), starting from the left-hand side of [0, 1). We have already shown that this is possible. 
It is easy to see that we can construct a prefix machine T as follows: If T p is the leftmost largest 
binary interval of I x , y ,u then T(p,y) = x. Otherwise, T(p,y) = oo (T does not halt). 

By construction of ip, for each pair x,y there is a t such that tp(x,y,t) > m(x\y)/2. Each 
interval I x ,y,t has length ip(x,y,t)/2. Each /-interval contains a binary interval T p of length at 
least one-half of that of / (because the length of / is of the form 2~ k , it contains a binary 
interval of length 2~ fc_1 ) . Therefore, there is a p with T(p,y) = x such that 2~' p ' > m(x|y)/8. 
This implies K T (x\y) < log l/m(x\y) + 3, which was what we had to prove. ■ 

Corollary 2. The above result plus Corollary 1 give: If P is a lower semicomputable con- 
ditional semiprobability mass function. Then there is a constant cp = K(P) + 0(1) such that 
K(x\y) < log 1/P(x\y) + c P . 
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VI. Conclusion 

The conditional version of the Coding Theorem of L.A. Levin, Theorem 4, requires a lower 
semicomputable conditional semiprobability that multiplicatively dominates all other lower semi- 
computable conditional semiprobabilities as in Theorem 3. The conventional form of the con- 
ditional (1), applied to the distribution (2) satisfying the original Coding Theorem (3) is false. 
This is shown by Theorems 1 and 2. 

Appendix 

A. Self-delimiting Code 

A binary string y is a proper prefix of a binary string x if we can write x = yz for z ^ e. 
A set {x, y, . . .} C {0, 1}* is prefix-free if for any pair of distinct elements in the set neither is 
a proper prefix of the other. A prefix-free set is also called a prefix code and its elements are 
called code words. An example of a prefix code, that is useful later, encodes the source word 
x = x±x 2 ■ ■ ■ x n by the code word 

x = rox. 

This prefix-free code is called self-delimiting, because there is fixed computer program associated 
with this code that can determine where the code word x ends by reading it from left to right 
without backing up. This way a composite code message can be parsed in its constituent code 
words in one pass, by the computer program. (This desirable property holds for every prefix-free 
encoding of a finite set of source words, but not for every prefix-free encoding of an infinite set 
of source words. For a single finite computer program to be able to parse a code message the 
encoding needs to have a certain uniformity property like the x code.) Since we use the natural 
numbers and the binary strings interchangeably, \x\ where x is ostensibly an integer, means the 
length in bits of the self-delimiting code of the binary string with index x. On the other hand, 
\x\ where x is ostensibly a binary string, means the self-delimiting code of the binary string 
with index the length |x| of x. Using this code we define the standard self-delimiting code for x 
to be x' = \x\x. It is easy to check that \x\ = 2n + 1 and \x'\ = n + 2 logn + 1. Let (■} denote a 
standard invertible effective one-one encoding from Af x Af to a subset of Af. For example, we 
can set [x, y) = x'y. We can iterate this process to define (x, (y, z)), and so on. For Kolmogorov 
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complexity it is essential that there exists a pairing function such that the length of (u, v) is 
equal to the sum of the lengths of u,v plus a small value depending only on \u\.) 

B. Kolmogorov Complexity 

For precise definitions, notation, and results see the text [10]. For technical reasons we use 
a variant of complexity, so-called prefix complexity, which is associated with Turing machines 
for which the set of programs resulting in a halting computation is prefix free. We realize prefix 
complexity by considering a special type of Turing machine with a one-way input tape, a separate 
work tape, and a one-way output tape. Such Turing machines are called prefix Turing machines. 
If a machine T halts with output x after having scanned all of p on the input tape, but not further, 
then T(p) = x and we call p a program for T. It is easy to see that {p : T(p) = x, x e {0, 1}*} 
is a prefix code. 

Let 7\, T 2 , . . . be a standard enumeration of all prefix Turing machines with a binary input tape, 
for example the lexicographical length-increasing ordered prefix Turing machine descriptions 
[10]. Let 0i, 02, • • • be the enumeration of corresponding prefix functions that are computed by 
the respective prefix Turing machines (T; computes 00- These functions are the partial recursive 
functions or computable functions (of effectively prefix-free encoded arguments). We denote the 
function computed by a Turing machine % with p as input and y as conditional information by 
0j(p, y). One of the main achievements of the theory of computation is that the enumeration 
Ti, T 2 , . . . contains a machine, say T u , that is computationally universal and optimal in that it can 
simulate the computation of every machine in the enumeration when provided with its program 
and index. Namely, it computes a function U such that u ((i,p), y) = <j>%(p, y) for all i,p, y. We 
fix one such machine and designate it as the reference universal Turing machine or reference 
Turing machine for short. 

Definition 6. The conditional prefix Kolmogorov complexity of x given y (as auxiliary infor- 
mation) with respect to prefix Turing machine Tj is 

Ki(x\y) = min{|p| : fcfay) = x}. (6) 
p 

The conditional prefix Kolmogorov complexity K(x\y) is defined as the conditional Kolmogorov 
complexity K u (x\y) with respect to the reference prefix Turing machine T u usually denoted by 
U. The unconditional version is set to K(x) = K(x\e). 
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The prefix Kolmogorov complexity K(x\y) satisfies the following so-called Invariance Theo- 
rem: 



for all i, x, y, where q depends only on i (asymptotically, the reference machine is not worse than 
any other machine). Intuitively, K(x\y) represents the minimal amount of information required 
to generate x by any effective process from input y (provided the set of programs is prefix- 
free). The functions K(-) and K (■]•), though defined in terms of a particular machine model, 
are machine-independent up to an additive constant and acquire an asymptotically universal and 
absolute character through Church's thesis, see for example [10], and from the ability of universal 
machines to simulate one another and execute any effective process. 

Quantitatively, K(x) < \x\ + 2 log \x\ + 0(1). A prominent property of the prefix-freeness of 
K(x) is that we can interpret 2~ K ^ as a probability distribution since K(x) is the length of a 
shortest prefix-free program for x. By the fundamental Kraft's inequality [7] (see for example 
[1], [10]) we know that if Z 1; Z 2 , . . . are the code-word lengths of a prefix code, then J2 X ^~ lx < 1- 
Hence, 



This leads to the notion of universal distribution m(x) = 2~ K ^ which we may view as a 
rigorous form of Occam's razor. Namely, the probability m(x) is great if x is simple (K(x) is 
small like K(x) = 0(log \x\)) and m(x) is small if x is complex (K(x) is large like K(x) > \x\). 

The Kolmogorov complexity of an individual finite object was introduced by Kolmogorov [6] 
as an absolute and objective quantification of the amount of information in it. The information 
theory of Shannon [13], on the other hand, deals with average information to communicate 
objects produced by a random source. Since the former theory is much more precise, it is 
surprising that analogs of theorems in information theory hold for Kolmogorov complexity, be 
it in somewhat weaker form. An example is the remarkable symmetry of information property 
used later, see [15] for the plain complexity version, and [3] for the prefix complexity version 
below. Let x* denote the shortest prefix-free program x* for a finite string x, or, if there are 
more than one of these, then x* is the first one halting in a fixed standard enumeration of all 



K(x\y) < Ki(x\y) + c t 



(7) 




(8) 



X 
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halting programs. Then, by definition, K(x) = \x*\. Denote K(x,y) = K((x,y)). Then, 

K(x, y) = K(x) + K(y \ x*) + 0(1) (9) 
= K(y) + K(x\y*) + 0(l). 

Remark 2. The information contained in x* in the conditional above is the same as the 
information in the pair (x, K(x)), up to an additive constant, since there are recursive functions 
/ and g such that for all x we have f(x*) = (x, K(x)) and g(x, K(x)) = x*. On input x*, the 
function / computes x = U(x*) and K(x) = \x*\; and on input x, K(x) the function g runs all 
programs of length K(x) simultaneously, round-robin fashion, until the first program computing 
x halts — this is by definition x*. 

C. Computability Notions 

If a function has as values pairs of nonnegative integers, such as (a, b), then we can interpret 
this value as the rational a/b. We assume the notion of a computable function with rational 
arguments and values. A function f(x) with x rational is semicomputable from below if it is 
defined by a rational- valued total computable function <fi(x, k) with x a rational number and k a 
nonnegative integer such that <p(x, k + 1) > 4>(x, k) for every k and lim^oo <p(x, k) = f(x). This 
means that / (with possibly real values) can be computably approximated arbitrary close from 
below (see [10], p. 35). A function / is semicomputable from above if — / is semicomputable 
from below. If a function is both semicomputable from below and semicomputable from above 
then it is computable. 

We now consider a subclass of the lower semicomputable functions. A function / is a 
semiprobability mass function if J2 x f(x) < 1 and it is a probability mass function if J2 x f(x) = 
1. It is customary to write p(x) for f(x) if the function involved is a semiprobability mass 
function. 

D. Precision 

It is customary in this area to use "additive constant c" or equivalently "additive 0(1) term" 
to mean a constant, accounting for the length of a fixed binary program, independent from 
every variable or parameter in the expression in which it occurs. In this paper we use the prefix 
complexity variant of Kolmogorov complexity for convenience. Prefix complexity of a string 
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exceeds the plain complexity of that string by at most an additive term that is logarithmic in the 
length of that string. 
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